Biobank-scale methods and projections for sparse polygenic prediction from machine learning

In this paper we characterize the performance of linear models trained via widely-used sparse machine learning algorithms. We build polygenic scores and examine performance as a function of training set size, genetic ancestral background, and training method. We show that predictor performance is most strongly dependent on size of training data, with smaller gains from algorithmic improvements. We find that LASSO generally performs as well as the best methods, judged by a variety of metrics. We also investigate performance characteristics of predictors trained on one genetic ancestry group when applied to another. Using LASSO, we develop a novel method for projecting AUC and correlation as a function of data size (i.e., for new biobanks) and characterize the asymptotic limit of performance. Additionally, for LASSO (compressed sensing) we show that performance metrics and predictor sparsity are in agreement with theoretical predictions from the Donoho-Tanner phase transition. Specifically, a future predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.63_{(0.02)}$$\end{document}0.63(0.02) and for height a correlation of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.648_{(0.009)}$$\end{document}0.648(0.009) for a Taiwanese population. This is above the measured values of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.61_{(0.01)}$$\end{document}0.61(0.01) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.631_{(0.008)}$$\end{document}0.631(0.008), respectively, for UK Biobank trained predictors applied to a European population.

important in different groups and adjusting their specific weights 15,[18][19][20][21][22] (e.g., by using minor allele frequency differences or functional information). The complexity of this analysis clearly scales with the number of relevant features. Third, future benefits of PGS 8,11,11,16,17, rely on genotyping future participants. If this genotyping can be restricted to a small number of SNPs (e.g. as opposed to more costly WGS) it can be more cost effective to implement. Fourth, most PGS development approaches and methods use linear models. Further challenges include non-linear SNP effects (e.g., which are responsible for the difference between narrow and broad sense heritability [44][45][46][47][48][49][50], the relationship between tagged vs causal SNPs, and incorporating genome-environment interactions. Addressing these challenges again scales with the number of relevant features. In this paper we focus on the performance, detailed application, and future power of sparse algorithms, i.e. algorithms that perform feature selection, for the 11 traits listed in Table 1. Performing feature selection can help ameliorate some of the issues raised above. Sparse algorithms have previously been shown to be comparable to non-sparse methods in terms of standard metrics (e.g., area under receiver operator curve (AUC), correlation, r 2 , etc.) 11,[51][52][53] . We focus on 11 phenotypes that have been previously been shown to cover a wide range of sparsities (see Table 1): asthma, atrial fibrillation, breast cancer, coronary artery disease (CAD), hypertension, type 1 and type 2 diabetes (T1D, T2D), body mass index (BMI), direct bilirubin, height, and lipoprotein A.
As mentioned, there are many efforts to modify PGS trained in primarily European ancestries to improve performance in non-European ancestries. This is a complex endeavor that involves trait specific characteristics, allele frequency and effect size dependencies, LD structures, and more. (See for example 52,54,[54][55][56][57][58][59][60][61] for recent progress along these lines and 62 for a review of outstanding challenges). As we discuss, these improvements, while important, are unlikely to close the gap completely. To directly close this gap, large data cohorts from or including diverse ancestry groups are needed to perform de novo, ancestry specific training. This is difficult and expensive. Hence it is valuable to understand in advance what the resulting benefits will be for polygenic prediction. In this work we present a novel method for projecting and predicting the results from sparse algorithms in a novel biobank.
The main results of this paper can be summarized as follows 1. Widely-used sparse methods perform comparably, with a simple LASSO-based approach regularly achieving the best results. 2. Increased biobank/database size and access to new datasets will lead to large gains, especially for the performance of PGS in diverse ancestries. 3. We develop a novel method which predicts correlation and AUC for continuous and case/control phenotypes (respectively), with uncertainty bands, for biobank-sized datasets. This new projection method can be used to guide biobank recruitment. 4. We explore details of "phase change" behavior of compressed sensing/LASSO with increasing data size, and the corresponding the SNP content of resulting predictors.

Results
Here we present the main results of this project. Additional details concerning specific methods are found in Section "Methods" and in the Supplementary Information. The main results are obtained using UKB genotypephenotype data. Training with PRScs at times uses linkage disequilibrium (LD) information from the 1000 Genomes Project 1 (1kg). Projections are given for de novo training in TPMI and AoU. We refer to ancestry groupings European (EUR), South Asian (SAS), East Asian (EAS), African (AFR), and American (AMR). These labeling conventions come from 63 .
Comparison of sparse predictors. We compare the performance of several sparse methods: LASSO, Elastic Net, L1-penalized Logistic regression (for case-control conditions), and PRScs with LD matrix information from either UKB or 1KG. It is important to note that the results presented here for AUC and correlation are for purely genetic PGS. In brief, phenotypes are first regressed on covariates (such as age, sex, and the first Table 1. The 11 phenotypes studied in this work and their relative sparsity. As described in Section "Sparsity" predictor sparsity can be defined in a variety of ways. Here we color-code the various traits according to the order of magnitude of SNP sparsity. Numbers in parentheses are the approximate number of SNPs used in a LASSO trained predictor using the maximum amount of data from the UKB. This definition of sparsity is consistent with the number of SNPs with non-zero weights found in previous publications 11,12 . www.nature.com/scientificreports/ 20 genetic principal components) -this allows them to explain as much of the variance as possible so that we can conservatively estimate purely genetic effects. Only then are SNP predictors are trained (using genotypephenotype data or summary statistics in the case of PRScs) on residual phenotypes. Further details about the training and evaluation of PGS can be found in Section "Methods". In Fig. 1 we see the comparison results for asthma and height. Similar plots for the other phenotypes can be found in the Supplementary Information. Ancestry groups SAS, AFR, EAS, and EUR result from UKB definitions of self-reported ancestry (although training, as described in Section "Methods" involves a principal component, or PC, adjustment). AMR refers to an American-like group constructed via principal component clustering detailed in the Supplementary Information and similar to that found in 7 . Sib refers to a set of white siblings (i.e. every member of the set has at least one sibling also in the set) where the ancestry is self-reported, but the sibling status is determined by a genetic analysis as detailed in the Supplementary Information. This sib-set attempts to partially control for environmental effects as described in 64,65 . It also allows for performing sibling selection experiments as described in Section "Methods". All results reflect training on a EUR population and then applied to a set of siblings or a different ancestry group, not used in training. The bands for TMPI/AoU are based on projections for future biobanks-this is described in detail in Section "Biobank projections".
Uncertainty (error bars) depends on cross-validation (i.e. multiple training sets), finite size effects from computing AUC/correlation, and from sample sizes. Details about uncertainty calculations are given in the Supplementary Information. For case-control conditions this can lead to error bars that are the same size as the central values. However, for continuous phenotypes with much larger sample sizes this is not the case. For example, compare the AMR group on both plots in Fig. 1. On the left, the AUC error bar overlaps 0.5 (i.e. consistent with no signal), while on the right, the correlation error bar is relatively small.
Comparing performance across ancestry groups, we see a well known fall-off behavior which is observed when predictors trained in one population are applied to another (e.g., see 7,52 ). The amount of fall-off is phenotype specific and ranges from complete fall-off, e.g. diabetes fall-off from sib to AFR, to negligible fall-off, e.g. breast cancer AUC from sibs to EAS. The relative order of fall-off is also phenotype specific. For most traits, the (EURancestry) sibling set shows the least reduction-then either SAS, EAS, and AMR,-and finally AFR. However, there are clear exceptions like the direct bilirubin correlation which goes from largest metric to smallest: Sib, SAS, AFR, EAS, AMR. There have been recent arguments that PGS fall-off is roughly linear as a function of local genetic distance 52 . We should note that this claim is not necessarily in conflict with the results we present here. First, there are many exceptions to this general linear behavior, e.g., Figure 5 in 52 . Second, the claimed linear fall-off is a function of Euclidean distance in PC space of training population. For genetically distant groups, the axes of variation will be different: any measure of PC distance is therefore a local measure. In other words, only ancestry groups that are near-enough to the original population where PCs were computed can be considered well-ordered in terms of genetic difference. These effects can also be exacerbated by the fact that these are all sparse predictors and, after projecting onto PC space, the order of Euclidean genetic-distance may change.
The performance of PGS can always be confounded due to environmental factors, interactions between genes and the environment, and non-linear genetic effects (e.g., epistasis). To attempt to guard against some of these effects we can perform sibling tests similar to those described in 64,65 . Genetic siblings can be assumed to have, on average, a more similar environmental background than unrelated individuals. However, there can be a competing effect from the enhancement of signals from effects like genetic nurture 66 . For case control conditions we can create affected sibling pairs (ASPs) where one person is a case and one is a control. Then we can ask what fraction of the time does the higher PGS correspond to the case vs control. We can also condition this question on the PGS difference being larger than some cut-off (e.g., 1.5, 2, or 2.5 standard deviations). For continuous phenotypes we can simply compare the fraction of the time where the person with the higher PGS also has the higher phenotype value. We can again condition this question for siblings whose phenotypes are separated by a cut-off (e.g., 0.5, 1, or 1.5 standard deviations). These selection rates for asthma and BMI are shown in Fig. 2 as an example; the results for all traits considered in this work can be found in the Supplementary Information. Again, we see that www.nature.com/scientificreports/ while several methods are extremely competitive, LASSO is regularly among the best performing methods. For larger and larger cut-offs, the selection rate improves, but the associated uncertainty also increases largely because of decreasing sample sizes. Interestingly, while PRScs performed similarly to other methods in terms of AUC and correlation, it routinely underperforms methods training directly on genetic data in sibling selection tests. PGS for case-control conditions can be converted to more clinically interpretable metrics. In Fig. 3 we see an example of an inclusive odds ratio (OR) for asthma. It is inclusive in the sense that the OR corresponds to the ratio of the cases to controls (normalized by the ratio of the total cases to controls) at a specific PGS value or above. As before uncertainties are conservative and include contributions from multiple cross-validation folds and finite size effects. Similar plots for the other case control conditions can be found in the Supplementary Information. Because LASSO routinely performed among the best predictors in terms of AUC and correlation we only present OR plots for this method. Analogous plots for the other methods can be generated similarly, although for all 7 case-control traits it leads to 357 plots which can be difficult to interpret. An initial interpretation of these results is that at extreme values of PGS there are large increases in OR: for asthma, within ancestry testing (i.e. the sibling group) leads to 2 < OR < 2.5 at large PGS. When testing on other ancestries, an optimistic interpretation is that asthma OR ranges from 1.25 to 3.5 at large PGS. While this is encouraging, we also urge caution. The extremes of the PGS distributions are the regions where model assumptions are most likely to break down, e.g., the linearity of SNP effects. Additionally, the sample sizes in these regions are smallest which leads to large uncertainties and difficulty interpreting the results. In addition, odds ratios are difficult to model in the presence of non-Gaussian distributions. The inclusive odds ratio can be written as a ratio of cumulative distribution functions of cases and controls. Similarly the PGS percentile can be written as an integral over the sum of the probability distribution functions for cases and controls.

Figure 2.
Left: affected sibling pair (ASP) selection rate for asthma. Pairs of siblings, where one person is a case and the other a control, are used and the rate corresponds to the number of times the case sibling has the higher PGS. The rate of correct selection, and uncertainty, increases if the siblings are also separated by at least 1.5, 2, or 2.5 standard deviations in PGS. Right: rank order selection rate for BMI respectively. The rate corresponds to frequency of the sibling with the larger BMI also having the larger PGS. Again the selection rate, and uncertainty (due to reduced statistics), increase if the sibling BMI is required to differ by at least 0.5, 1, or 1.5 standard deviations. Similar results for the other phenotypes are found in the Supplementary Information. These tests were developed in 64,65 . More detailed descriptions of these tests and how siblings are defined can be found in Section "Methods" and in the Supplementary Information. www.nature.com/scientificreports/ Biobank projections. This section details how training at various sizes in current biobanks can be used to model the growth dependence on training size. For most clinically relevant metrics, e.g., correlation and AUC below, this growth can be modeled with uncertainty. Using relatively simple parametric functions, we find that we can model the growth of these metrics. The results can be used to guide future studies and genotype-phenotype database construction by identifying where large gains can be made. For sparse methods like LASSO, it has been previously shown that using self-reported ancestry performs similarly to principal component based clustering when based on AUC and correlation metrics 11 . Nonetheless, after sorting on self-reported ancestry, we perform an additional regression on the top 20 PCs to adjust for any remaining population stratification.
To be conservative in the projections for PRS performance in other biobanks, several modeling assumptions are made. First, to be conservative we assume that these biobanks will have population prevalence rates for diseases even though some biobanks over-recruit cases to enrich their datasets. Additionally, the actual incidence rate for disease conditions fluctuates over time. For the conditions considered in this paper we try to consider the most recent surveys of all ages. Finally, when there are various estimates for disease prevalence within ancestry sub-groups, we choose a conservative (i.e. low) prevalence as the representative for the overall ancestry prevalence. Specific details about within ancestry prevalences are given in the Supplementary Information. Preliminary training using TPMI and the LASSO method described here was presented at ASHG 2022 67 and is consistent with the projection function predictions.
In Fig. 4 we see projection bands for asthma and BMI. The bands grow from no signal (0.5 AUC and 0 correlation respectively) to asymptotic values. The various colored prediction bands correspond to the Monte Carlo (MC) confidence intervals for the various fit functions as described in Section "Methods". The asymptotic predictions for each trait can be averaged, incorporating the confidence intervals, and the results are given in Table 2. Further projection plots for the remaining traits can be found in the Supplementary Information.
The fraction of phenotypic variance captured by SNPs (i.e., the linear, narrow sense heritability) of traits is traditionally estimated via means such as Restricted Maximum Likelihood estimates (REML) and using Linkage Disequilibrium Score Regression (LDSR). For continuous traits, the correlation of the residual phenotype with the PGS can be related to the linear SNP heritability explained by the predictor by simply squaring the correlation. In Table 2 this can be seen in the final column. The heritability explained by the predictor is a lower bound on the REML heritability in that there may not be enough data to saturate the REML estimates. In Fig. 4 we can see the heritability estimates from GCTA (using REML) and LDSR converted to a correlation scale. To make this comparison, we assume that a phenotype is determined by the sum of a genetic component and another uncorrelated random component (i.e., P = G + E). In this case, the heritability is simply the square of the correlation between P and G. This allows a direct comparison of the new correlation results presented here and traditional methods like GCTA and LDSR. For traits like BMI and height it appears that, eventually, sparse predictors will capture all the linear SNP heritability. For much sparser conditions, e.g. Lipoprotein A presented in the Supplementary Information, it appears that sparse methods are out performing traditional measures of heritability.
Sparse output interpretation. An advantage of sparsity is that, because there are fewer features, it is relatively easier to categorize features compared to non-sparse methods. Here we identify important features for predictors for each trait. Because a simple LASSO routinely performed as one of the best predictors in terms of AUC and correlation in Section "Comparison of sparse predictors" we focus here on interpreting the LASSO outputs.
In Fig. 5 we see an example of the SNP content for an asthma predictor as it is trained with larger and larger training sizes. The LASSO weights β , single-SNP-variance (SSV), and training sizes are all further described in Figure 4. Growth of AUC (left: asthma) and correlation (right: BMI) as a function of training size in the UKB. Colored, curved bands come from fitting data with various 4 parameter functions. Width of the band corresponds to a confidence interval on the predictions: on the left 2 standard deviations or ∼ 68% and on the right 4 standard deviations or ∼ 95% . Vertical bars represent projections for de novo training in other biobanks using literature prevalences, summarized in the Supplementary Information. If one assumes that a phenotype is determined by the sum of a genetic component and another uncorrelated random component (i.e., P = G + E), then the heritability is simply the square of the correlation between P and G. On the right, this apprximation is used to convert heritability predictions from GCTA and LDSR to horizontal correlation bands. www.nature.com/scientificreports/ Section "Methods". A notable feature from this figure is that, as training size is increased, the LASSO algorithm first adds more SNPs (becomes less sparse) while barely increasing the AUC. Eventually the algorithm gets rid of most of these SNPs, becomes much more sparse, and then quickly increase AUC as SNPs are then added again. This is an example of the algorithm "searching" for seemingly important features. Once some of the important features are identified, the algorithm finds more and more features that are important as evidenced by the rising AUC. Note that, as described in Section "Methods", LASSO does not try to directly optimize AUC. This sparsity behavior (rising and falling before eventually finding important features that greatly increase AUC/correlation) appears in the rest of the predictors seen in the Supplementary Information and appears to be related to the Donoho-Tanner phase transition. This behavior can be seen in other features of each predictor-e.g., the SSV-as displayed in Supplementary Information. The Donoho-Tanner phase transition [68][69][70][71] comes from the field of compressed sensing and generally refers to sharp behavior in the recovery of a sparse signal in underdetermined systems. In simple language, signal recovery or predictor construction changes sharply as relevant parameters change. In genomics, key parameters are trait heritability, sparsity (i.e., number of important variants affecting the trait), and the sample size. In the case of genomics, crossing the phase transition boundary does not guarantee that true causal SNPs are found. For modern applications to this phase boundary for the LASSO see, e.g., 72,73 and for its application to genetics see 74 .
At the largest training sizes-i.e., after the phase transition-we also look at how the SNP content varies across cross-validation (CV) folds. As detailed in Section "Methods", for case control phenotypes the largest training size includes as many controls as possible while all previous training sizes contain an equal number of cases and controls. In the Supplementary Information we have examples of the SNP content for the different types of large training sizes. For the largest training sizes we can also average over the CV folds to find the fraction of SSV per chromosome. Plots for SSV per chromome for each phenotype can be found in the Supplementary Information. There are several important takeaways from these results. The main metric for case-control phenotypes is much more affected by the number of cases than controls. Specifically, increasing the number of cases used in training increases the AUC, but increasing the number of controls in training, even by a factor of 2 or more, either does not change the AUC or leads to a very minor change. However, when looking at the single SNP variance, the number of both cases and controls do have appreciable effects. Take for example asthma. If we look for common features among the folds in maximal training (i.e., using all possible cases and controls) we see a modest jump in SSV at specific locations on chromosome 1, 2, 5, 6, 9, 10, 11, and 17. Additionally the increase in SSV on chromosome 17 is much larger on a single fold. In contrast, we also perform near-maximal training with the maximum number of cases and equal number of controls. Near-maximal training results in high impact regions (regions where the SSV increases by at least a few percent) at places similar to those found in maximal training (i.e., all cases and controls used), but the fraction of SSV at the beginning of chromosome 9 is much larger and the effect on chromosome 17 seems to be more smoothed out. This feature, which contrasts the two cases, is also seen when looking at the SSV fraction per chromosome. These plots, and the corresponding plots for the other phenotypes, are found in the Supplementary Information.
To interpret these plots we have collected and identified every SNP that accounts for at least 1% of total SSV. These tables are found in the Supplementary Information. Here we review the most important SNPs and compare them to known results, i.e. we identify genes, associated with the particular phenotypes, that are within at least 2 million base pairs of a SNP accounting for > 1% SSV. We choose a 2 million base pair distance to be a conservative measure of possible long-range LD 75 . All predictors, except those for Lipoprotein A, include SNPs that account for at least 1% of total SSV but are not near any known gene that associates with the phenotype. Additionally, we list the p-value associated with a GWAS on the raw (case control) or adjusted (continuous) phenotype to highlight that many of these important LASSO SNPs would be missed by a traditional GWAS approach.
There are several interesting aspects of examining SNPs in this manner. First we note that we are interested in common variants and exclude SNPs with a minor allele frequency below 0.001 to avoid any spurious associations. Because of this, rare variants can't appear in our analysis, even if they are known to be associated with a phenotype. An example of this can be seen in the case of breast cancer where the BRCA mutations aren't included in our analysis. Interestingly there are SNPs that have previously been identified via GWAS, that are available on our array, but are not selected by LASSO. An example would be rs116716490 which is part of the ZBTB10 gene, previously associated with asthma via GWAS 78 , but not selected by LASSO. A more coarse grained interpretation of the impactful regions can be found in the Supplementary Information. There, we examine-for each trait-the fraction of SSV that resides on each chromosome. For case-control phenotypes, we also compare the result for max possible training (i.e., the max possible cases and controls), and training with the largest possible equal number of cases and controls. We highlight some of the notable results. For atrial fibrillation, max training and equal case control training both find a large fraction of SSV on chromosome 4, but the signal is much larger for equal cases and controls. On chromosome 15 there is a large signal for max training, but not for equal cases and controls. For type 2 diabetes a large fraction of SSV is on chromosome 10, but the signal is largest for equal cases and controls. For type 1 diabetes. equal cases and controls find a strong signal on chromosome 6 while max training also finds signals on chromosomes 1, 11, 15, and 17. For CAD, the equal case control training finds the largest signal in chromosome 6 while max training finds similarly large signals in chromosomes 1, 2, 3, 6, and 12. Hypertension shows varying signals all throughout the chromosome with the most precise signal coming from chromosome 1. Breast cancer has the most diverse differences between the two types of training with large signals for equal case control training on chromosomes 10 and 16, and large signals on 7, 11, 16, and 19 for max training. For continuous phenotypes we only have one measure of SSV per chromosome. For direct bilirubin the largest contributions are on chromosomes 1 and 2. For BMI the largest signals are on chromosomes 1, 2, and 3. For height there is strong signal throughout most of the genome. Finally for Lipoportein A, the signal seems to be concentrated on chromosome 6. Fig. 6 where the definitions themselves are explaind in Section "Sparsity". The traits are roughly grouped according to their heritability estimate using GCTA 99 . Using all metrics together, the scatter of data can be regressed linearly in log 10 scales in both training size N and sparsity s. We find log 10 (s) = 0.7 (0.1) log 10 (N) + 0.1 (0.5) (the black line in Fig. 6) with an r 2 = 0.43 . This weakly implies s ∼ N 0.7 (or conversely N ∼ s 1.4 ). In 74 it was shown, using simulated data, that for h 2 = 0.5 (where h 2 represents the narrow-sense heritability) the compressed sensing phase transition occurs at N ≈ 30 s . The result reported here is consistent with this previous prediction, but tighter error bars are required to completely determine the coefficient and its dependence on h 2 .

Sparsity and heritability. We can see examples of all the definitions of sparsity in
Finally we can see estimates of heritability using a variety of metrics in Fig. 7 and in the Supplementary Information. In Fig. 7 we see examples of these heritability estimates, from a LASSO training, as a function of training size. (Additional phenotypes are in the Supplementary Information). Additionally, for the largest training size, we also record these heritability estimates for the other training methods. As a function of training size we generally see the phenotype dependent phase transition behavior noticed before. After a critical amount of training data is used, the error bars for all methods greatly decrease. Identifying this transition behavior is important; it indicates that central values below the critical training size may suggest a deceptively high estimate. For most phenotypes-regardless of training size, metric, or method-the estimated heritability is below what can be estimated from GCTA. However, there are a few exceptions: for asthma and type 1 diabetes, at the largest training size PRScs using the metric outperforms GCTA; for height the • metric generally, and PRScs at largest training, outperforms GCTA; and for Lipoprotein A, every metric and method except PRScs results in a higher value than GCTA.

Methods
Predictor training. We start with a very brief description of the general training pipeline for generating predictors: step (1) initial populations are separated into groups; step (2) quality control (QC) for both phenotypes and genotypes; step (3) phenotypes are regressed on covariates and adjusted phenotypes are built; step (4) SNP set is filtered down to a computationally manageable size; step (5) machine learning on adjusted phenotype and genotypes using cross-validation; step (6) predictors are tested on withheld groups. As an initial grouping of UKB data, we separate participants using self-reported ancestry (a principal component adjustment is done below). We use participants reporting ancestry as White (EUR), South Asian (SAS), East Asian (EAS), and Black (AFR). Additionally we construct an American ancestry (AMR) set done using the approach from 7 and described in the Supplementary Information (there is a small 60 person subset of the 322 AMR labeled participants who also self identify as white and could appear in the training. This set is so small that its effects are assumed to be negligible). From the EUR population we identify genetic siblings as described in the Supplementary Information and remove them to use as a final testing set. The remaining EUR population is used for training.
Step 2 involves performing quality control on both the genotypes and phenotypes. Full phenotype definitions, including UKB codes, are given in the Supplementary Information. For phenotype quality control we exclude any missing values or negative values (usually used as a placeholder or indicator in the UKB). For continuous Figure 6. Sparsity measurements, as a function of training size, for all 11 traits. Different markers correspond to different (arbitrary) estimated heritability groupings. Different colors correspond to different versions of sparsity. Heritablity here for case-control phenotypes is broad sense heritability reported from twin/family study literature, whereas GCTA was used to estimate heritability for continuous phenotypes. Low heritability traits (circles) include: atrial fibrillation, breast cancer, and BMI 100,101 . Medium heritability traits (squares) include: CAD, hypertension, direct bilirubin, height, and lipoprotein A 102,103 . High heritability traits (triangles) include: asthma and type 1/2 diabetes 104-106 .  Step 3 can be briefly described as sex-specific z-scoring and covariate adjustment. The z-scoring is only done on continuous phenotypes while covariate adjustment is done for all phenotypes. We z-score to improve the efficiency of the machine learning algorithms (e.g., using normalized data lowers the risk of large numbers appearing in a gradient descent algorithm). If we assume that we are looking for common genetic factors that are independent of sex then we can z-score each sex individually and roughly double our training data. The ultimate aim is to identify genetic variants that we are most confident are related to a phenotype. To do this we assume that common covariates and population stratification have a maximal effect. That is, we regress covariates on the raw (or z-scored) phenotype and then adjust (i.e., create a residual phenotype) phenotypes for the contribution explained by these covariates. Common covariates included are: age, sex (except for sex specific traits like breast cancer), and the top 20 principal components as computed by the UKB. Even though we do a sex specific z-scoring, at this stage we still assume sex can have an impact. That is, for phenotype, y , and covariates, H , we regress: � y ∼ � αH . Then we can construct an adjusted phenotype, y adj , that just includes the residual signal: � y adj = y − � αH. Unfortunately, before we employ a machine learning algorithm, we have to reduce down the ∼ 600k SNPs to a computationally manageable number. The exact computational details will depend on the machine being used to run the analysis and whether or not computational cost saving measures can be used (e.g., parallelization). For our analysis, as shown in the Supplementary Information, the time to run lasso scales as a function of training data, N, roughly as ∼ N 1. 35 . Additionally, the memory and CPU usage, while growing more slowly than exponential, grows quickly. For step 4, we choose to subset our SNP set by selecting the top 50k SNPs (10k for penalized logistic regression which is more computationally intensive) via GWAS with the training set.
Next, in step 5 we run machine learning algorithms. For LASSO and Elastic Net we use the Scikit-learn 107 lasso path linear_model.lasso_path and enet path linear_model.enet_path algorithm. The Elastic Net algorithm minimizes the objective function, where X is the normalized genotype matrix, β the regression weights, the hyperparameter, L 1ratio relatively weights the L 1 and L 2 penalties. As L 1ratio → 1 this becomes LASSO and L 1ratio → 0 it becomes ridge regression. For all phenotypes we run for 5 different ratio weights: L 1ratio ∈ {.1, .3, .5, .7, .9} . For L 1 penalized logistic regression we use the Scikit-learn function linear_model.LogisticRegression with an L1 penalization and parallelize it using the Multiprocessing package Pool function. For all methods of (penalized) regression we use five-fold cross validation and use a 2,500 sample validation set (withheld from training) for hyperparemeter selection. For case-control phenotypes, the validation set is equal number of cases and controls. Finally, we also use PRS-cs 108 which is a Bayesian shrinkage prior which runs directly on summary statistics (i.e., GWAS output) and an LD matrix. We use three fold cross validation and run with the global shrinkage/sparseness parameter, φ ∈ {10 −1 , 10 −3 , 10 −5 , 10 −7 , 10 −9 } and the local scale parameters {a, b} ∈ {1/2, 1, 3/2}.
In the final step 6 we apply the predictors to all testing sets: the EUR genetic siblings set, an SAS set, an EAS set, an AFR set, and an AMR set. Case control phenotypes are evaluated by computing the Area Under the receiver operator Curve (AUC) which compares the true positive rate to the false positive rate, i.e. how well the predictor is calling cases and controls. Within this work we generally report the AUC only using the genetic weights to emphasize how well cases and controls can be identified using only genetic information. For case control phenotypes we can also compute an odds ratio, as a function of polygenic score, as seen in Fig. 3. This is an inclusive odds ratio: at each score, we compute the ratio of the number of cases to controls (normalized to the total cases and controls) of all individuals with that score or a higher score. I.e., the 80th percentile represents all individuals with a score ≥ 80% . For continuous phenotypes we simply report the correlation between the polygenic score and the adjusted phenotype, Corr(PGS, y adj ) , and again this estimates the purely genetic contribution.
In addition to standard metrics like AUC and correlation, we are often interested in the amount of variance described by a particular polygenic predictor. For a particular polygenic score, the variance is described by www.nature.com/scientificreports/ where the approximation defines the Single SNP Variance (SSV). This approximation assumes that the covariance between SNPs is small. For most sparse methods, this covariance is minimized to enforce sparsity. For example, in the objective function Eq.
(1), adding extra SNPs with non-zero weights amounts to decreasing the first term and increasing the second term. For an added SNP that is highly correlated with another SNP with non-zero weight, the decrease in the first term is likely smaller than the increase in the second. The accuracy of this approximation is demonstrated in the Supplementary Information.
Sibling tests. For both types of phenotypes there are sibling specific tests that we perform. These were developed in 64,65 . All siblings used in these analyses are genetic siblings identified using KING 109,110 and filtered in a manner similar to 110 . More detail is provided in the Supplementary Information. Siblings generally share a similar environment growing up which helps to control for some external factors. For case control phenotypes we can consider Affected Sibling Pairs (ASPs), which are pairs of siblings where one person is a case and the other a control. We can then look at the correct selection rate, that is the fraction of the time the person with the larger polygenic score corresponds to the case. This computation can be done again while requiring that the sibling pair's scores are at least 1.5, 2, or 2.5 standard deviations different. For continuous phenotypes we show a similar selection rate for the amount of time the person with the larger polygenic score has the larger phenotype. Again this rate can be recomputed with the requirement that the phenotype difference between the siblings is at least 0.5, 1, or 1.5 standard deviations.
Metric projection. We can model improvement in predictor performance metrics as a function of training data size. In Fig. 4 we see examples of this for asthma and BMI. We use four functions, which have left and right asymptotes, to model this growth: sigmoid, inverse tangent, error function, and hyperbolic tangent. The hyperbolic tangent can be written as a rescaled version of a sigmoid function. We use that as a cross-check, i.e. we make sure both functions give the same results, to avoid fitting routines getting stuck in a local minimum. For all functions, four parameters are used to fit the performance, e.g., erf4(x; a, b, c, d) = a + b erf(c(x + d)) . These parameters are subject to physical constraints. Roughly this corresponds to 0.5 < AUC < 1 , 0 < Corr < 1 , metric growth is a function of training size, and the growth happens when training with multiple samples. Care is taken to incorporate uncertainty from cross-validation and finite data sizes. More details about the functions, uncertainty calculations, and fitting results are given in the Supplementary Information. The non-linear fit of each functional form involves computing a Hessian matrix as a function of the parameters {a, b, c, d} . The inverse of this matrix is the correlation matrix for the model's fit parameters. Using these empirical correlations we build a MC method that relies on a Cholesky decomposition. For a given correlation matrix C of fit parameters, we can generate a correlated set of MC parameters, Y , from random numbers, X , via a Cholesky matrix, L: We can then build MC bands as seen in Fig. 4. For both types of phenotypes, these projection bands are informative: they indicate (1) the asymptotic (i.e. best possible) metric obtained from infinite data, as in Table 2 (2) which traits are nearing that asymptotic maximum or which are still improving substantially with more training data, and (3) what sample size other biobanks will need to obtain meaningful results. These projection bands can be compared to the application of European trained predictors applied to other ancestry groups. In Fig. 1 we see that predictors built using data from TPMI and AoU will greatly surpass the results of UKB trained predictors when applied to distant ancestry groups.
More details can be found in the Supplementary Information and in the code examples found in the affiliated Github repository.
Sparsity. The sparsity associated with a phenotype can be defined in several ways and it is not a priori clear which definition is most appropriate. The definitions used here are: 1. The number of SNPs with non-zero weights using all possible cases and controls. This quantity can appear to continue to grow without reaching an asymptote even though traditional metrics (e.g. AUC or correlation) appear to asymptote. However, this definition is obviously impacted by training sample size and difficult to compare between datasets/biobanks. Assuming infinite data generated by a linear model plus noise, compressed sensing guarantees complete signal recovery and the sparsity will reach an asymptotic value corresponding to the underlying model that generated the data 74 . Additionally, for algorithms like LASSO there is debate as whether to use the maximal metric for hyperparameter selection or to step one standard deviation "back" (i.e. towards fewer features) in hyperparameter space to avoid over-fitting when applying to outside true testing sets. (see 11 and references therein). 2. We can use the previous definition but for the training case with the maximum number of equal cases and controls. The same caveats from definition 1. apply. 3. Because we expect traits not to be 100% heritable, we expect metrics to asymptotically approach a limit value.
We can look at relative increase in these metrics to try to identify the point of largest growth (or above some cut), i.e. the inflection point. We can do this by looking at a relative metric, i.e. (y i+1 − y i )/y i . We can look at the sparsity of the predictors at this point as a second version of sparsity. 4. Instead of looking at the metrics, we can instead look at the relative increase in the number of features (SNPs) at each training size and look for the relative maximum increase (or increase above some cut) in features to define a sparsity value. www.nature.com/scientificreports/ Comparisons for these definitions can be seen in Fig. 6.
Fraction of variance explained. Heritability, H 2 , is traditionally defined as the proportion of phenotypic variance explained by genetic factors. All the work discussed here involves linear genomic prediction, i.e. that PGS ≡ y PGS = � β ·X . The proportion of variance explained by a linear model is generally referred to as the narrow-sense or linear heritability, h 2 linear . There are several ways we can estimate the fraction of variance explained via the results of a linear predictor. First, we note that our linear predictors are trained after sex specific z-scoring. Because of this, the overall scale of y PGS (i.e. the variance) is expected to be of the order of the adjusted phenotype, not the original raw phenotype. Therefore, we can consider the ratio of the variance of the PGS to the adjusted phenotype ( denotes var(pgs)/var(y adj ) ) or we can undo the sex specific z-scoring only on the PGS and compute the ratio of variance in this rescaled PGS to the raw phenotype ( • denotes var(y * pgs )/var(y raw ) ). Finally, for continuous traits we can simply look at the correlation between pgs and adjusted phenotype as in 11,51,64 .
In both Section "Biobank projections" and in Section "Sparse output interpretation" we refer to the heritability estimated via REML and LDSR. The LDSR results are reported from 111,112 . The REML results are produced using the GCTA software 99,113  Public Data. Raw data is not available for direct sharing, but can be obtained via application to the UK biobank (https:// www. ukbio bank. ac. uk/ enable-your-resea rch/ apply-for-access). Predictors (i.e. SNPs and weights) and some code used to produce results can be found at https:// github. com/ MSU-Hsu-Lab/ bioba nkscale-metho ds-paper-2023/.

Discussion
Development of a large genetic database or biobank is a significant endeavor. To ensure that these efforts are focused on phenotypes and recruitment sample targets that will produce the most beneficial results, we have developed a novel method for projecting polygenic score performance to the size of novel biobanks. We make specific predictions for TPMI and AoU based on their currently stated recruitment goals and known population prevalences for various phenotypes. These and other biobanks are still recruiting and genotyping participants; hopefully these results can be used to plan recruitment targets and drive new analyses.
Sparse methods optimize prediction while activating as few features (SNPs) as possible during training. In contrast, non-sparse methods construct predictors in which potentially every SNP in the genome has non-zero (but possibly very small) effect size. Because sparse predictors perform about as well as predictors constructed using non-sparse methods [114][115][116] , it is reasonable to conclude that actual genetic architectures are themselves sparse. For all known complex traits and disease risks, only a small fraction of common SNPs are required to build a predictor which performs nearly, or equally, as well as any non-sparse predictor.
In this paper we analyzed many aspects of sparse predictors. Within the class of algorithms that produce spare predictors we find that LASSO, also known as compressed sensing, is competitive with other more complex techniques. In fact, methodological improvements in predictor training, although important, generally produce improvements which are relatively small (in rough terms, of order 5 or 10%), whereas increases in data size are likely to produce much larger gains. Additionally, we demonstrate the performance characteristics promised from the compressed sensing literature. Specifically, we demonstrate phase transition behavior in performance using actual genotype/phenotype pairs-above a certain data threshold we recover the SNPs which provide the strongest contribution to performance metrics. Additionally, techniques to improve predictor portability (e.g., 52,54,[54][55][56][57][58][59][60][61] ) rely on understanding the SNP content of the original predictor. This issue is considerably simplified if the predictors are trained using sparse algorithms to narrow down the possible resulting SNP set.
We develop a methodology for projecting the performance of LASSO on larger datasets. This method can be applied to anticipated future biobanks and to analysis which is forthcoming on existing biobanks. Specifically, we project that a predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of 0.63 (0.02) and for height a correlation of 0.648 (0.009) for a Taiwanese population. For comparison, the measured values are 0.61 (0.01) and 0.631 (0.008) , respectively, for UK Biobank trained predictors applied to a European population. We also show that, in terms of AUC, atrial fibrillation, type 2 diabetes, CAD, and breast cancer will more than double their signal if trained on larger datasets.
Other researchers have investigated the question of how prediction and association results depend on sample size and have tried to predict the results for future large sample sizes. We highlight here some lines of research in this direction so that we can emphasize how our new analysis is different. First, most of the previous work involves GWAS based methods (e.g., 95,[117][118][119][120][121][122][123] whereas this work involves machine learning algorithms that go beyond single marker regression. Secondly, much of the previous work makes assumptions about how performance metrics, e.g. AUC or Area Under receiver operator Curve, depend on trait heritability, model parameter effect distributions, and their functional form. For example, in 124,125 , AUC is modeled with a cumulative distribution function for the Gaussian distribution and depends on the population variance of the PGS. Using these approaches, after fitting to the training of a single PGS, predictions can be made for other training sample sizes. In this work we make no such assumptions about the functional form of AUC or the scaling dependence of an underlying genetics effects model. Instead we simply assume that AUC is a finite and bounded quantity. We then use a series of different functional forms to model the growth of AUC. These functional forms are chosen because based on parsimony, i.e. that they have the fewest number of parameters for functions that satisfy the simple physical constraint. This is an advantageous approach when there are limited or small samples with which www.nature.com/scientificreports/ to train. In contrast, in this work for each phenotype we use 11 different training sizes to model the growth of performance metrics. In using such a large number of different training sizes, we developed a novel method for taking into account the uncertainty/error associated with the metrics when modeling the growth as a function of training size. Throughout this work we are primarily concerned with sparse methods that are trained directly on genomes. The one exception being the PRScs method, which is a summary-statistics (GWAS) based method. We stress that that our main results are concerned with training methods that directly use genome matrices themselves and might not generalize to the comparison to generic GWAS based methods. For example, although a simple LASSO regularly outperformed or performed comparably to PRScs, it is not necessarily true that this will be the case for all GWAS based methods with similar training sizes-especially if it includes additional information like fine-mapping, functional information, other -omics, etc. However, it has been found that for some phenotypes, e.g. height, methods like LASSO usually require much smaller sample sizes than GWAS based methods to find the same phenotype-PGS correlation 51,95 . GWAS based methods however have an advantage over direct genome training: it is relatively easy to combine GWAS summary statistics from disjoint studies to increase sample size. This is advantageous as it requires less individual computing resources for training while still increasing sample size. For example, meta-analysis of GWAS for height and BMI has found novel loci 122 . Some GWAS based methods, like PRScs, require information about the LD structure. Computing LD information can be computationally intensive, but it only has to be done once and then the results can be efficiently reused 126,127 .
One of the main challenges of polygenic prediction is that most of the available data comes from studies in which most of the participants are individuals of European ancestry. Thus the current predictors perform poorly when applied to other ancestry groups (e.g., East Asians or Africans). The hope of developing methods which "transport" a predictor trained on one ancestry to other ancestral groups remains, but despite ongoing research efforts the goal remains elusive. In the absence of such a breakthrough, much larger cohorts of non-Europeans are required to produce predictors of comparable quality. The new methods developed here allow us to predict, e.g., how well new predictors trained on biobank-scale data from the Taiwan Precision Medicine Initiative (which is planned to surpass 1 million genotypes) and the US All of Us project. In the former case, we predict that new predictors will exceed AUC / correlation metrics for current best-in-class predictors for European ancestry.

Data availability
Access to the UK Biobank resource is available via application (http:// www. ukbio bank. ac. uk). UKB data was collected under policies conforming with national and local requirements and subject to privacy rights. (https:// www. ukbio bank. ac. uk/ priva cy-policy) All data handled by researchers in this work was de-identified. MSU researchers working on this project do not have access to any identifiable data and are covered under MSU IRB [STUDY00006493].